Search CORE

8 research outputs found

You can’t suggest that?! : Comparisons and improvements of speller error models

Author: Kaalep Heiki-Jaan
Moshagen Sjur
Pirinen Flammie
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 30/08/2022
Field of study

In this article, we study correction of spelling errors, specifically on how the spelling errors are made and how can we model them computationally in order to fix them.The article describes two different approaches to generating spelling correction suggestions for three Uralic languages: Estonian, North Sámi and South Sámi.The first approach of modelling spelling errors is rule-based, where experts write rules that describe the kind of errors are made, and these are compiled into finite-state automaton that models the errors.The second is data-based, where we show a machine learning algorithm a corpus of errors that humans have made, and it creates a neural network that can model the errors.Both approaches require collection of error corpora and understanding its contents; therefore we also describe the actual errors we have seen in detail.We find that while both approaches create error correction systems, with current resources the expert-build systems are still more reliable

Septentrio Academic Publishing

Munin - Open Research Archive

You can’t suggest that?! Comparisons and improvements of speller error models

Author: Kaalep Heiki-Jaan
Moshagen Sjur Nørstebø
Pirinen Flammie
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 30/08/2022
Field of study

In this article, we study correction of spelling errors, specifically on how the spelling errors are made and how can we model them computationally in order to fix them. The article describes two different approaches to generating spelling correction suggestions for three Uralic languages: Estonian, North Sámi and South Sámi. The first approach of modelling spelling errors is rule-based, where experts write rules that describe the kind of errors are made, and these are compiled into finite-state automaton that models the errors. The second is data-based, where we show a machine learning algorithm a corpus of errors that humans have made, and it creates a neural network that can model the errors. Both approaches require collection of error corpora and understanding its contents; therefore we also describe the actual errors we have seen in detail. We find that while both approaches create error correction systems, with current resources the expert-build systems are still more reliable

Munin - Open Research Archive

GiellaLT — a stable infrastructure for Nordic minority languages and beyond

Author: HiovainAsikainen Katri
Moshagen Sjur N.
Pirinen Flammie A
Publication venue: University of Tartu Library
Publication date: 01/05/2023
Field of study

DSpace at Tartu University Library

Mii eai leat gal vuollánan – Vi ha neimen ikke gitt opp

Author: Argese Chiara
Gaup Børre
Omma Thomas
Pirinen Flammie
Wiechetek Linda
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 30/08/2022
Field of study

Munin - Open Research Archive

Mii eai leat gal vuollánan -- Vi ha neimen ikke gitt opp: En hybrid grammatikkontroll for å rette kongruensfeil

Author: Argese Chiara
Gaup Børre
Omma Thomas
Pirinen Flammie
Wiechetek Linda
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 30/08/2022
Field of study

Machine learning is the dominating paradigm in natural language processing nowadays. It requires vast amounts of manually annotated or synthetically generated text data. In the GiellaLT infrastructure, on the other hand, we have worked with rule-based methods, where the linguistis have full control over the development the tools. In this article we uncover the myth of machine learning being cheaper than a rule-based approach by showing how much work there is behind data generation, either via corpus annotation or creating tools that automatically mark-up the corpus. Earlier we have shown that the correction of grammatical errors, in particular compound errors, benefit from hybrid methods. Agreement errors, on the other other hand, are to a higher degree dependent on the larger grammatical context. Our experiments show that machine learning methods for this error type, even when supplemented by rule-based methods generating massive data, can not compete with the state-of-the-art rule-based approach.Maskinlæringsteknikker der lingvistisk ekspertise ikke brukes dominerer språkteknologi nå til dags. Dette krever at man merker opp en stor datamengde manuelt på forhånd. I GiellaLT-infrastrukturen har man der- imot jobbet med regelbaserte metoder der lingvisten har kontroll over hvordan verktøyene fungerer. Det er ikke bare tekniske årsaker for metodevalget. Kunnskapsøkning om samisk grammatikk, kvalitetssikring og kontrollerbarhet (verktøyene gjør det de skal gjøre også ifølge menneskelige standard) ligger bak preferansen om å jobbe regelbasert. I denne artikkelen vil vi forsøke å avdekke myten om at maskinlæring blir billigere enn regelbaserte metoder. Likevel tror vi at maskinlæringsmetoder kan være nyttige der vi ønsker større dekning av feilretting. Vi viser at maskinlæringsmodeller som har tilgang til små datameng- der (i dette tilfelle for små språk) er avhengig av gode regelbaserte verktøy som erstatning for manuell oppmerking

Septentrio Academic Publishing

Munin - Open Research Archive

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Author: Alos i Font Héctor
Bayatlı Sevilay
Khanna Tanmai
Pirinen Flammie
Swanson Daniel
Tang Irene
Tyers Francis Morton
Washington Jonathan North
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

This paper presents an overview of Apertium, a free and open-source rule-based machine translation platform. Translation in Apertium happens through a pipeline of modular tools, and the platform continues to be improved as more language pairs are added. Several advances have been implemented since the last publication, including some new optional modules: a module that allows rules to process recursive structures at the structural transfer stage, a module that deals with contiguous and discontiguous multi-word expressions, and a module that resolves anaphora to aid translation. Also highlighted is the hybridisation of Apertium through statistical modules that augment the pipeline, and statistical methods that augment existing modules. This includes morphological disambiguation, weighted structural transfer, and lexical selection modules that learn from limited data. The paper also discusses how a platform like Apertium can be a critical part of access to language technology for so-called low-resource languages, which might be ignored or deemed unapproachable by popular corpus-based translation technologies. Finally, the paper presents some of the released and unreleased language pairs, concluding with a brief look at some supplementary Apertium tools that prove valuable to users as well as language developers. All Apertium-related code, including language data, is free/open-source and available at https://github.com/apertium

Munin - Open Research Archive

NORA - Norwegian Open Research Archives